An LLM safety monitor's evaluator can be tricked into clearing dangerous sessions when the attacker plants fake analysis text in the monitored conversation. Experimental results, defense limits, and structural separation points.
A security researcher bypassed Claude Opus 4.6's policy evaluation with just four short prompts, generating attack code against live infrastructure. Plus 915 files exfiltrated from the sandbox.
GitHub releases the layered defense design of the agent execution platform, and OpenAI releases the instruction hierarchy training data IH-Challenge and model. Responses to prompt injection were received from both infrastructure design and training axes.